Indexing and searching strategies for the Russian language
نویسندگان
چکیده
This paper describes and evaluates various stemming and indexing strategies for the Russian language. We design and evaluate two stemming approaches, a light and a more aggressive one, and compare these stemmers to the Snowball stemmer, to no stemming, and also to a language-independent approach (n-gram). To evaluate the suggested stemming strategies we apply various probabilistic information retrieval (IR) models, including the Okapi, the Divergence from Randomness (DFR), a statistical language model (LM), as well as two vectorspace approaches, namely, the classical tf idf scheme and the dtu-dtn model. We find that the vector-space dtu-dtn and the DFR models tend to result in better retrieval effectiveness than the Okapi, LM, or tf idf models, while only the latter two IR approaches result in statistically significant performance differences. Ignoring stemming generally reduces the MAP by more than 50%, and these differences are always significant. When applying an n-gram approach, performance differences are usually lower than an approach involving stemming. Finally, our light stemmer tends to perform best, although performance differences between the light, aggressive, and Snowball stemmers are not statistically significant.
منابع مشابه
Noospheric Psychological-Educational Paradigm as a Methodological Basis for Teaching Russian-Language Business Communication to Foreign Students
In the context of the polyparadigmatic system of higher education, the noospheric psychological-pedagogical paradigm is considered, on its basis a lingvodidactic model is developed for the formation of professional-communicative competence (PCC) in Russian-language business communication among foreign students. The research focuses on the basic principles of the noospheric paradigm, which procl...
متن کاملوضعیت بازیابی اطلاعات در دو پایگاه نمایه و نما و سنجش اثربخشی استفاده از واژگان کنترل شده در نمایهسازی این دو پایگاه
Purpose: This study was carried out to determine the level of precision, recall, and searching time for “Nama” and “Namayeh” databases, as well as to find out which of the indexing tools (thesaurus and Dewey decimal classification) helps us more in improvement of information retrieval. Methodology: This study is an analytical survey in which the necessary data was collected by direct observati...
متن کاملThe Language and Culture of a Dream: A Case Study
This study has analyzed the culture and language of the American Dream in Blue Surge. It shows the effects of the formula of success and the competition presented by this dream; and, how it produces neurotic individuals trying to cope with the competitive society by means of neurotic strategies. This study has used Karen Horney’s theories and strategies. Horney says neurosis is engende...
متن کاملAnalysis of Language Legislation of All 85 Russian Federation’s Subjects (Regions)
The analysis of the language legislation of all 85 subjects of the Russian Federation shows complete heterogeneity and diversity. Common legal guidelines in Federal law do not exist, because Federal legislation is obsolete and is largely whitespace and conflict. The subjects of the Russian Federation, on whose territory different ethnic groups, both large and indigenous, historically live, solv...
متن کاملTesting Problems in Russian as a Foreign Language in a Technical University
Problems of theory and practice of the Russian as a foreign language testing for entrants in technical universities are considered. The benefits of test forms for controlling the foreign students’ skills in the Russian language during a hard time limit are presented. The structure and content of the tests, all types of tasks offered on the entrance and final examinations in the Russian languag...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
- JASIST
دوره 60 شماره
صفحات -
تاریخ انتشار 2009